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^D We present a unified framework to study graph kernels, special cases of which include 

2? the random walk graph kernel (Gartner et al., 2003; Borgwardt et al., 2005), marginal- 

ized graph kernel (Kashima et al., 2003, 2004; Mahe et al., 2004), and geometric kernel 
on graphs (Gartner, 2002). Through extensions of linear algebra to Reproducing Kernel 
Hilbert Spaces (RKHS) and reduction to a Sylvester equation, we construct an algorithm 
^^ that improves the time complexity of kernel computation from 0{n^) to Oijr'). When 

C^ the graphs are sparse, conjugate gradient solvers or fixed-point iterations bring our algo- 

rithm into the sub-cubic domain. Experiments on graphs from bioinformatics and other 
application domains show that it is often more than a thousand times faster than previous 
approaches. We then explore connections between diffusion kernels (Kondor and Lafferty, 
2002), regularization on graphs (Smola and Kondor, 2003), and graph kernels, and use 
these connections to propose new graph kernels. Finally, we show that rational kernels 
(Cortes et al., 2002, 2003, 2004) when specialized to graphs reduce to the random walk 
graph kernel. 
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1. Introduction 

We begin by providing some background, establishing some basic notation, and giving an 
outline of the paper. 

1.1 Background 

Machine learning in domains such as bioinformatics (Sharan and Ideker, 2006), chemoin- 
formatics (Bonchev and Rouvray, 1991), drug discovery (Kubinyi, 2003), web data mining 
(Washio and Motoda, 2003), and social networks (Kumar et al., 2006) involves the study 
of relationships between structured objects. Graphs are natural data structures to model 
such structures, with nodes representing objects and edges the relations between them. In 
this context, one often encounters two questions: "How similar are two nodes in a given 
graph?" and "How similar are two graphs to each other?" 

Kernel methods (Scholkopf and Smola, 2002) offer a natural framework to study these 
questions. Roughly speaking, a kernel k{x^x') is a measure of similarity between objects 
X and x' . It must satisfy two mathematical requirements: it must be symmetric, that is, 
k{x^x') — k{x\x), and positive semi-definite (p.s.d.). Comparing nodes in a graph involves 
constructing a kernel between nodes, while comparing graphs involves constructing a kernel 
between graphs. In both cases, the challenge is to define a kernel that captures the semantics 
inherent in the graph structure but at the same time is reasonably efficient to evaluate. 

Until now, the above two types of kernels have largely been studied separately. The 
idea of constructing kernels on graphs (i.e., between the nodes of a single graph) was 
first proposed by Kondor and Lafferty (2002), and extended by Smola and Kondor (2003). 
Kernels between graphs were proposed by Gartner (2002) (geometric kernels on graphs) 
and Gartner et al. (2003) (random walk graph kernels), and later extended by Borgwardt 
et al. (2005). Much at the same time, the idea of marginalized kernels (Tsuda et al., 2002) 
was extended to graphs by Kashima et al. (2003, 2004), and further refined by Mahe et al. 
(2004). A seemingly independent line of research investigates the so-called rational kernels, 
which are kernels between finite state automata based on the algebra of abstract semirings 
(Cortes et al., 2004, 2003, 2002). 

The aim of this paper is twofold: on one hand we present theoretical results showing 
that these four strands of research are in fact closely related, on the other we present 
new algorithms for efficiently computing kernels between graphs. Towards this end we first 
establish some notation and review pertinent concepts from linear algebra and graph theory. 

1.2 Linear Algebra Concepts 

We use e^ to denote the z*^ standard basis vector (that is, a vector of all zeros with the i^^ 
entry set to one), e to denote a vector with all entries set to one, to denote the vector 
of all zeros, and I to denote the identity matrix. When it is clear from context we will not 
mention the dimensions of these vectors and matrices. 
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Definition 1 Given real matrices A G MJ^^^ and B G W^^, the Kronecker product A^B G 
^npxmq ^^^ coluvfin- stacking operator Yec{A) G M^^ are defined as 



A0B : = 



' AiiB A12B . 


■ AimB ' 


, vec(A) := 


■ A*l 


_ AnlB An2B . 


AnmB 




^*?77 



where A^j denotes the j column of A. 

Kronecker product and vec operator are linked by the well-known property (e.^., Bernstein, 
2005, proposition 7.1.9): 



Yec{ABC) = (C^0 A)vec(S). 



(1) 



Another well-known property of the Kronecker product, which we use in Section 5, is 
(Bernstein, 2005, proposition 7.1.6): 



{A (g) B){C (^ D) = AC (^ BD. 



(2) 



A closely related concept is that of the Kronecker sum which is defined for real matrices 
A e R"^"* and B e RP^« as 



A(BB := A0l„g + l 



iS, 



(3) 



with Inm {resp. Ipq) denoting the nxm (resp. px q) identity matrix. Many of its properties 
can be derived from those of the Kronecker product. 

Finally, the Hadamard product of two real matrices A, S G R^^^, denoted hj AQ B ^ 
j^nxm^ is obtained by element- wise multiplication. It interacts with the Kronecker product 
via 



{A0 B) Q {C D) = {AQ C) ® {B Q D). 



(4) 



All the above concepts can be extended to a Reproducing Kernel Hilbert Space (RKHS) 
(See Appendix A for details). 



1.3 Graph Concepts 

A graph G consists of an ordered set of n vertices V — {i^i, 1^2, • • • , '^n}, and a set of edges 
E dV xV . A vertex vi is said to be a neighbor of another vertex Vj if they are connected 
by an edge, i.e., if (vi^Vj) G £"; this is also denoted Vi ~ Vj. A walk of length t on G 
is a sequence of indices zi, Z25 • • • ^t+i such that Vi^^ ~ ^ik+i ^^^ all 1 < A: < t. A graph 
is said to be connected if any two pairs of vertices can be connected by a walk. In this 
paper we will always work with connected graphs. A graph is said to be undirected if 
(vi.Vj) G E ^^ (vj.Vi) G E. 

In much of the following we will be dealing with weighted graphs, which are a slight 
generalization of the above. In a weighted graph, each edge (vi^Vj) has an associated 
weight Wij > signifying its "strength". If Vi and Vj are not neighbors, then Wij = 0. In 
an undirected weighted graph Wij = Wji. 
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The adjacency matrix of an unweighted graph is an n x n matrix A with A^j = 1 if 
Vi r^ Vj, and otherwise. The adjacency matrix of a weighted graph is just the matrix 
of weights, Aij = Wij. In both cases, if G is undirected, then the adjacency matrix is 
symmetric. The diagonal entries of A are always zero. 

The adjacency matrix has a normalized cousin, defined A := D~^ A, which has the 
property that each of its rows sums to one, therefore it can serve as the transition matrix 
for a stochastic process. Here, D is a, diagonal matrix of node degrees, d^, such that 
Da — di — J2j^ij' ^ random walk on G is a process generating sequences of vertices 
Vi^^Vi^^Vi^^ . . . according to F(ij^^i\ii^ . . .ik) = ^ifc,ifc+i, that is, the probability at Vij^ of 
picking Vif^^-^ next is proportional to the weight of the edge {vij^^Vij^^-^)- The t^^ power of 
A thus describes t-length walks, i.e., [A^]ij is the probability of a transition from vertex Vi 
to vertex Vj via a walk of length t. If po is an initial probability distribution over vertices, 
the probability distribution pt describing the location of our random walker at time t is 
Pt = A^pq. The j^^ component of pt denotes the probability of finishing a t-length walk at 
vertex Vj. We will use this intuition to define generalized random walk graph kernels. 

Let A' be a set of labels which includes the special label (. Every edge-labeled graph G 
is associated with a label matrix X G X^^^ such that Xij — C,\S. (vi, Vj) ^ £", in other words 
only those edges which are present in the graph get a non-(^ label. Let H be the RKHS 
endowed with the kernel k : X x X ^ M., and let (/): X ^ TC denote the corresponding 
feature map which maps ( to the zero element of TC. We use $(X) to denote the feature 
matrix of G (see Appendix A for details). For ease of exposition we do not consider labels 
on vertices here, though our results hold for that case as well. Henceforth we use the term 
labeled graph to denote an edge-labeled graph. 

1.4 Paper Outline 

In the first part of this paper (Sections 2-4) we present a unifying framework for graph 
kernels, encompassing many known kernels as special cases, and connecting to others. We 
describe our framework in Section 2, prove that it leads to p.s.d. kernels, and discuss 
random walk graph kernels, geometric kernels on graphs, and marginalized graph kernels 
as special cases. For ease of exposition we will work with real matrices in the main body 
of the paper and relegate the RKHS extensions to Appendix A. In Section 3 we present 
three efficient ways to compute random walk graph kernels, namely 1. via reduction to a 
Sylvester equation, 2. using a conjugate gradient (CG) solver, and 3. using a fixed point 
iteration. Experiments on a variety of real and synthetic datasets in Section 4 illustrate 
the computational advantages of our approach, which reduces the time complexity of kernel 
computations from 0{nP) to O(n^). 

In the second part (Sections 5-7) we draw further connections to existing kernels on 
structured objects. In Section 5 we present a simple proof that rational kernels are p.s.d., 
and show that specializing them to graphs yields random walk graph kernels. In Section 6 we 
discuss the relation between R-convolution kernels and various incarnations of graph kernels. 
In fact, all known graph kernels can be shown to be instances of R-convolution kernels. We 
also show that extending the framework therough the use of semirings does not always result 
in a p.s.d. kernel; a case in point is the optimal assignment kernel of Frohlich et al. (2006). 
In Section 7 we shift our attention to diffusion processes on graphs and associated kernels; 
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this leads us to propose new kernels on graphs, based on the Cartesian graph product. We 
show that the efficient computation techniques we introduced in Section 3 are also applicable 
here, but are ultimately forced to conclude that diffusion-based graph kernels are not useful 
in a general context. We conclude in Section 8 with an outlook and discussion. 

2. Random Walk Graph Kernels 

Our generalized random walk graph kernels are based on a simple idea: given a pair of 
graphs, perform random walks on both, and count the number of matching walks. We show 
that this simple concept underlies random walk graph kernels, marginalized graph kernels, 
and geometric kernels on graphs. In order to do this, we first need to introduce direct 
product graphs. 

2.1 Direct Product Graphs 

Given two graphs G{V, E) and G'{V' ^ £'')(with \V\ — n and \V'\ — n')^ their direct product 
Gx is a graph with vertex set 



and edge set 



V^={{vi,v[,):vieV,v[,eV'}, (5) 



^x = {{{vi,v[,), {vj,v'y)) : ivi,Vj)eEA{vl,,v'.,)eE'}. (6) 



In other words, Gx is a graph over pairs of vertices from G and G\ and two vertices in Gx 
are neighbors if and only if the corresponding vertices in G and G^ are both neighbors (see 
Figure 1 for an illustration). If A and A are the respective adjacency matrices of G and 
G\ then the adjacency matrix ofGx is Ax =A®A. Similarly, Ax = A® A\ 

Performing a random walk on the direct product graph is equivalent to performing a 
simultaneous random walk on G and G^ (Imrich and Klavzar, 2000). li p and p^ denote 
initial probability distributions over the vertices of G and G\ then the corresponding initial 
probability distribution on the direct product graph is px := p®p' . Likewise, if q and q' are 
stopping probabilities (that is, the probability that a random walk ends at a given vertex), 
then the stopping probability on the direct product graph is qx '.— q® q' . 

If G and G' are edge-labeled, we can associate a weight matrix Wx ^ K^^ ^^^ with 
Gx using our Kronecker product in RKHS (Definition 12): Wx = ^{^) ® ^{X'). As a 
consequence of the definition of $(X) and $(X^), the entries of Wx are non-zero only if 
the corresponding edge exists in the direct product graph. The weight matrix is closely 
related to the normalized adjacency matrix: assume that 7"^ = R endowed with the usual 
inner product, and (f){Xij) = 1/di if (vi^Vj) ^ E ot zero otherwise. Then $(X) = A and 
$(X') = A\ and consequently Wx = Ax, that is, the weight matrix is identical to the 
normalized adjacency matrix of the direct product graph. 

To extend the above discussion, assume that 7-^ = M^ endowed with the usual inner 
product, and that there are d distinct edge labels {1, 2, . . . , d}. For each edge {vj, Vj) ^ E 
we have (j){Xij) — e/ /di if the edge {vi^ Vj) is labeled /. All other entries of ^{X) are set to 
0. Hi is therefore a delta kernel, that is, its value between any two edges is one iff the labels 
on the edges match, and zero otherwise. The weight matrix Wx has a non-zero entry iff an 
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11' 



21' 



31' 




Figure 1: Two graphs (top left &; right) and their direct product (bottom). Each node 
of the direct product graph is labeled with a pair of nodes; an edge exists in 
the direct product if and only if the corresponding nodes are adjacent in both 
original graphs. For instance, nodes iV and 32^ are adjacent because there is an 
edge between nodes 1 and 3 in the first, and V and 2^ in the second graph. 



edge exists in the direct product graph and the corresponding edges in G and G^ have the 
same label. Let M denote the normalized adjacency matrix of the graph filtered by the label 
/, that is, ^Aij = Aij if Xij = / and zero otherwise. Some simple algebra (omitted for the 
sake of brevity) shows that the weight matrix of the direct product graph can be written as 



Wy, =J2^A0^A'. 



(7) 



^=1 



We will show in the sequel that kernels defined using the above weight matrix can be 
computed efficiently. 



2.2 Kernel Definition 

As stated above, performing a random walk on the direct product graph Gx is equivalent 
to performing a simultaneous random walk on the graphs G and G^ (Imrich and Klavzar, 
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2000). Therefore, the ((z — l)n + j, {%' — l)n' -\- j'Y^ entry of A\ represents the probabihty 
of simultaneous k length random walks on G (starting from vertex Vi and ending in vertex 
Vj) and G' (starting from vertex v[f and ending in vertex v^-f). The entries of Wx represent 
similarity between edges. The ((z — l)n + j, {%' — l)n' + j'Y^ entry of W^ represents the 
similarity between simultaneous k length random walks on G and G' measured via the 
kernel function n. 

Given the weight matrix Wx , initial and stopping probability distributions px and q^ , 
and an appropriately chosen discrete measure /x, we can define a kernel on G and G' as 

oo 

fc(G,G'):=J^Mfc)'zIW^xPx. (8) 

k=l 

In order to show that (8) is a valid p.s.d. kernel we need the following technical lemma: 
Lemma 2 V A: G N : W^px = vec[$(XOV {^XfpY]. 
Proof By induction over k. Base case: A: = 1. Observe that 

Px = {p®p)Yec{l) = Yec{pp^). (9) 

By using Lemma 13, WxPx can be written as 

[$(X) $(X0] vec(yp^) = Yec[^{X')p'p~^^{Xy] 

= Yec[^{X')p\^{X)py]. (10) 

Induction from k to fc+l: Using the induction assumption W^px = vec[$(X^)^y {^[X^pY] 
and Lemma 13 we obtain 

W^+Vx = WxW^Px = ($(X)®$(X'))vec[$(X')V(*(^)V)^] 

= vec[$(X')$(X') V {HXfp)^^{X)^] (11) 

= vec[$(X')'^+V ($(X)'=+^p)T]. 



Lemma 3 // the measure ii{k) is such that (8) converges, then it defines a valid p.s.d. 
kernel. 

Proof Using Lemmas 13 and 2 we can write 

= vec[g'^$(X') V {^{XfpYq] 

= (g^$(X)V)^ (g'^$(X') V) • (12) 

^(0^ p{G') 

Each individual term of (8) equals piG)^ p{G') for some function p, and is therefore a valid 
p.s.d. kernel. The lemma follows because the class of p.s.d. kernels is closed under convex 
combinations (Berg et al., 1984). ■ 
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2.3 Special Cases 

A popular choice to ensure convergence of (8) is to assume //(/c) = A^ for some A > 0. If A 
is sufficiently small^ then (8) is well-defined, and we can write 

fc(G,G0 = 5^A^g;^H-,^Px =g;^(I-AH-x)-V. (13) 

k 

Kashima et al. (2004) use marginalization and probabilities of random walks to define 
kernels on graphs. Given transition probability matrices P and P' associated with graphs 
G and G' respectively, their kernel can be written as (Kashima et al., 2004, Eq. 1.19) 

fc(G,GO = g;^(I-Tx)-V, (14) 

where Tx := [vec(P) vec(PO^] © [^(^) ® H^')]- The edge kernel k{Xij,X'.,.,) := 
PijP[,-,i^{Xij,X'..,) with A=l recovers (13). 

Gartner et al. (2003), on the other hand, use the adjacency matrix of the direct product 
graph to define the so-called random walk graph kernel 

n n' oo 

fc(G,G') = EEE^'[^xk.- (15) 

i=l j = l k=l 

To recover their kernel in our framework, assume an uniform distribution over the vertices of 
G and G\ that is, set p = q = 1/n and pJ — cl — 1/n' . The initial as well as final probability 
distribution over vertices of Gx is given by Px — (Ix — e/(nn^). Setting $(X) := A, 
$(X') = A\ and Wx = ^x, we can rewrite (8) to obtain 

oo ^ n n' oo 

k{G, G') = y: x'qw^p. = ;^v^ e e e ^'[^x].i, 

k=l 1=1 3=1 k=l 

which recovers (15) to within a constant factor. 

Finally, the so-called geometric kernel is defined as (Gartner, 2002) 

k{G,G') = EE[e^^x]^, = e^e^^xe, (16) 

1=1 j=i 

which can be recovered in our setting by setting p — q — 1/n, p' — q' — 1/n' ^ ^{L) := A, 
$(L0 = A^ and/i(A:) = X^/k\. 

3. Efficient Computation 

Computing the kernels of Gartner et al. (2003) and Kashima et al. (2004) essentially boil 
down to inverting the matrix (I —XWx)- If both G and G' have n vertices, then (I —XWx) 
is an n^ x n? matrix. Given that the complexity of inverting a matrix is cubic in its 
dimensions, a direct computation of (13) would require 0{n^) time. In the first part of 
this section we show that iterative methods, including those based on Sylvester equations, 
conjugate gradients, and fixed-point iterations, can be used to speed up this computation. 
Later, in Section 3.4, we show that the geometric kernel can be computed in O(n^) time. 

1 . The values of A which ensure convergence depend on the spectrum of Wx • See Chapter 6 of Vishwanathan 
(2002) for a discussion of this issue. 
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3.1 Sylvester Equation Methods 

Consider the following equation, commonly known as the Sylvester or Lyapunov equation: 

M = SMT + Mo. (17) 

Here, S, T, Mq G R^^^ are given and we need for solve for M G R^^^. These equations can 
be readily solved in O(n^) time with freely available code (Gardiner et al., 1992), such as 
Matlab's dlyap method. Solving the generalized Sylvester equation 

d 
M = ^5,MT, + Mo (18) 

i=l 

involves computing generalized simultaneous Schur factorizations of d symmetric matrices 
(Lathauwer et al., 2004). Although technically involved, this can also be solved efficiently, 
albeit at a higher computational cost. 

We now show that if the weight matrix Wx can be written as (7) then the problem of 
computing the graph kernel (13) can be reduced to the problem of solving the following 
Sylvester equation: 

d 
M = ^ A k' M k^ + Mo, (19) 

i=l 

where vec(Mo) = px- We begin by flattening (19): 

d 
vec(M) = A^vec(k'Mk^) +px. (20) 

i=l 

Using Lemma 13 we can rewrite (20) as 

d 
(I - A ^ k W) vec(M) = Px , (21) 

i=l 

use (7), and solve (21) for vec(M): 

vec(M) = (I-AW^x)-'px. (22) 

Multiplying both sides of (22) by q^ yields 

gJvec(M) = ql{I-XW^)-^p^. (23) 

The right-hand side of (23) is the graph kernel (13). Given the solution M of the 
Sylvester equation (19), the graph kernel can be obtained as gJvec(M) in O(n^) time. 
Since solving the Sylvester equation takes O(n^) time, computing the random walk graph 
kernel in this fashion is significantly faster than the O(n^) time required by the direct 
approach. 

Solving the generalized Sylvester equation requires computing generalized simultaneous 
Schur factorizations of d symmetric matrices, where d is the number of labels. If d is large. 
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the computational cost may be reduced further by computing matrices S and T such that 
Wx ^ S ® T. We then simply solve the simple Sylvester equation (17) involving these 
matrices. Finding the nearest Kronecker product approximating a matrix such as Wx is 
a well-studied problem in numerical linear algebra, and efficient algorithms which exploit 
sparsity of Wx are readily available (Van Loan, 2000). Formally, these methods minimize 
the Frobenius norm \\Wx — S ^T\\f by computing the largest singular value of a permuted 
version of Wx- In general this takes 0{n'^) time for an n^ by n^ matrix, but can be done 
in 0{din?) here since Wx is a sum of Kronecker products. Sparsity of Wx can then be 
exploited to speed this computation further. 

3.2 Conjugate Gradient Methods 

Given a matrix M and a vector 6, conjugate gradient (CG) methods solve the system 
of equations Mx — b efficiently (Nocedal and Wright, 1999). While they are designed 
for symmetric p.s.d. matrices, CG solvers can also be used to solve other linear systems 
efficiently. They are particularly efficient if the matrix is rank deficient, or has a small 
effective rank, that is, number of distinct eigenvalues. Furthermore, if computing matrix- 
vector products is cheap — because M is sparse, for instance — the CG solver can be sped 
up significantly (Nocedal and Wright, 1999). Specifically, if computing Mr for an arbitrary 
vector r requires 0{k) time, and the effective rank of the matrix is m, then a CG solver 
requires only 0{mk) time to solve Mx — b. 

The graph kernel (13) can be computed by a two-step procedure: First we solve the 
linear system 

(I-AWx)x = px, (24) 

for X, then we compute q^x. We now focus on efficient ways to solve (24) with a CG solver. 
Recall that if G and G^ contain n vertices each then Wx is a n^ x n? matrix. Directly 
computing the matrix- vector product WxT, requires 0{n^) time. Key to our speed-ups is 
the ability to exploit Lemma 13 to compute this matrix-vector product more efficiently: 
Recall that Wx = ^{X) $(X'). Letting r = vec(i?), we can use Lemma 13 to write 

Wxr = ($(X)0$(XO)vec(i?) = vec($(XOi?$(X)^). (25) 

If (/){•) G M^ then the above matrix-vector product can be computed in 0{n^d) time. If $(X) 
and ^{X') are sparse, however, then $(X')i?$(X)^ can be computed yet more efficiently: 
if there are 0{n) non-^" entries in $(X) and $(X^), then computing (25) requires only O(n^) 
time. 

3.3 Fixed-Point Iterations 

Fixed-point methods begin by rewriting (24) as 

x^Px +\WxX. (26) 

Now, solving for x is equivalent to finding a fixed point of the above iteration (Nocedal 
and Wright, 1999). Letting xt denote the value of x at iteration t, we set xq := Px, then 

10 



Graph Kernels 

compute 

a^t+i=Px ^\W^xt (27) 

repeatedly until ||x^+i — xt\\ < e^ where || • || denotes the Euclidean norm and e some pre- 
defined tolerance. This is guaranteed to converge if all eigenvalues of XW^ lie inside the 
unit disk; this can be ensured by setting A < l/^max, where ^max is the largest-magnitude 
eigenvalue of Wx • 

The above is closely related to the power method used to compute the largest eigenvalue 
of a matrix (Golub and Van Loan, 1996); efficient preconditioners can also be used to 
speed up convergence (Golub and Van Loan, 1996). Since each iteration of (27) involves 
computation of the matrix- vector product WxXt, all speed-ups for computing the matrix- 
vector product discussed in Section 3.2 are applicable here. In particular, we exploit the 
fact that Wx is a sum of Kronecker products to reduce the worst-case time complexity to 
O(n^) per iteration in our experiments, in contrast to Kashima et al. (2004) who computed 
the matrix- vector product explicitly. 

3.4 Geometric Kernel 

We now turn our attention to the geometric kernel, (16). If both G and G^ have n vertices 
then Ax is a, n? x n? matrix, and therefore a naive implementation of the geometric kernel 
requires 0{nP) time. The following lemma shows that this can be reduced to O(n^). 

Lemma 4 If G and G' have n vertices then the geometric kernel, (16)^ can he computed in 
0{v?) time. 

Proof Let A — PDP^ denote the spectral decomposition of A, that is, columns of P are 
the eigenvectors of A and I? is a diagonal matrix of corresponding eigenvalues (Stewart, 
2000). Similarly A' — P'D'P'^ . The spectral decomposition of a n x n matrix can be 
computed efficiently in O(n^) time (Golub and Van Loan, 1996). 

Using Propositions 7.1.10, 7.1.6, and 7.1.3 of Bernstein (2005) it is easy to show that 
the spectral decomposition of Ax is {P ® P'){D D'){P ® P')^ . Furthermore, the matrix 
exponential exp(AAx) can be written as {P®P') exp{XD D'){P P'^ (Bernstein, 2005, 
proposition 11.2.3). This and (2) allow us to rewrite (16) as 

A:(G,GO = {e®ey(P®P')exp{XD®D'){P®Py{e®e) (28) 

= (e^P e V) exp( Ai:> D^) (P^e P'^e) . (29) 

The proof follows by observing that each of the three terms in the above equation as well 
as their product can be computed in 0{n?) time. ■ 



4. Experiments 

Numerous other studies have applied random walk graph kernels to applications like protein 
function prediction (Borgwardt et al., 2005) and chemoinformatics (Kashima et al., 2004). 

11 
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Therefore we concentrate on runtime comparisons in our experimental evaluation. We 
present three sets of experiments. First, we work with randomly generated graphs and 
study the scaling behaviour of our algorithms. Second, we assess the practical impact of 
our algorithmic improvement by comparing the time taken to compute graph kernels on four 
real-world datasets whose size mandates fast kernel computation. Third, we devise novel 
methods for protein interaction network comparison using graph kernels. The algorithmic 
challenge here is to efficiently compute kernels on large sparse graphs. 

For all our experiments our baseline comparator is the direct approach of Gartner et al. 
(2003). Our code was written in Matlab Release 14, and experiments run under Suse Linux 
on a 2.6 GHz Intel Pentium 4 PC with 2 GB of main memory. We employed Lemma 13 
to speed up matrix- vector multiplication for both CG and fixed-point methods (c/. Sec- 
tion 3.2), and used the function dlyap from the control toolbox of Matlab to solve the 
Sylvester equation. By default, we used a value of A = 0.001, and set the convergence 
tolerance for both CG solver and fixed-point iteration to 10~^. The value of A was chosen 
to ensure that the random walk graph kernel converges. Since our methods are exact and 
produce the same kernel values (to numerical precision), where applicable, we only report 
the CPU time each algorithm takes. 

4.1 Randomly Generated Graphs 

The aim here is to study the scaling behaviour of our algorithms on graphs of different sizes 
and different node degrees. We generated two sets of graphs: for the first set, SET-1, we 
begin with an empty graph of size 2^, A: = 1, 2, . . . , 10, and randomly insert edges until a) 
the graph is fully connected, and b) the average degree of each node is at least 2. For each k 
we repeat the process 10 times and generate 10 such graphs. The time required to compute 
the 10 X 10 kernel matrix for each value of k is depicted in Figure 2 (left). As expected, 
the direct approach scales as O(n^), solving the Sylvester equation (SYL) as O(n^), while 
the conjugate gradient (CG) and fixed point iteration (FP) approaches scale sub-cubically. 
Furthermore, note that the direct approach could not handle graphs of size greater than 
128 = 2^ even after two days of computation. 

We also examined the impact of Lemma 13 on enhancing the runtime performance of 
the fixed-point iteration approach as originally proposed by Kashima et al. (2004). For 
this experiment, we again use graphs from SET-1 and computed the 10 x 10 kernel matrix, 
once using the original fixed-point iteration, and once using fixed-point iteration enhanced 
by Lemma 13. Results are illustrated in Figure 2 (right). Our approach is often 10 times 
or more faster than the original fixed-point iteration, especially on larger graphs. 

The second set of randomly generated graphs is called SET-2. Here, we fixed the size of 
the graph at 2^ = 32 (the largest size that the direct method could handle comfortably), and 
randomly inserted edges until a) the graph is fully connected, and b) the average number 
of non-zero entries in the adjacency matrix is at least x%, where x = 10, 20, . . . , 100. For 
each X, we generate 10 such graphs and compute the 10 x 10 kernel matrix. Our results are 
shown in the left panel of Figure 3. On these small graphs the runtimes of all the methods, 
including the direct method, is seen to be fairly independent of the filling degree. This is 
not surprising since the direct method has to explicitly compute the inverse matrix; the 
inverse of a sparse matrix need not be sparse. 
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Runtime vs. graph size 
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Figure 2: Time to compute a 10 x 10 kernel matrix on SET-1 plotted as a function of the size 
of graphs (# nodes). Left: We compare the Sylvester equation (SYL), conjugate 
gradient (CG), and fixed point iteration (FP) approaches to the direct method. 
The dashed thin red line indicates 0{nP) scaling, while the dashed thin black line 
indicates 0{n^) scaling. Right: We compare the runtime of the original fixed- 
point iteration (original) with the fixed-point iteration enhanced with Lemma 13 
(vec-trick) . 



In order to investigate further the behavior of our speedups on large graphs we generated 
a new random set of graphs, by using the same procedure as for SET-2, but with the graph 
size to 1024. The direct method is infeasible for such large graphs. We plot the runtimes of 
computing the 10 x 10 kernel matrix in the right panel of Figure 3. On these large graphs 
the runtimes of the Sylvester equation solver are fairly independent of the filling degree. 
This is because the Sylvester equation solver is not able to exploit sparsity in the adjacency 
matrices. On the other hand, the runtimes of both the conjugate gradient solver as well as 
the fixed point iteration increase with the filling degree. Especially for very sparse graphs 
(filling degree of less than 20%) these methods are seen to be extremely efficient. 

4.2 Real- World Datasets 

In our next experiment we use four real-world datasets: Two sets of molecular compounds 
(MUTAG and PTC), and two datasets describing protein tertiary structure (Protein and 
Enzyme). Graph kernels provide useful measures of similarity for all of these. We now 
briefly describe each dataset, and discuss how graph kernels are applicable. 

Chemical Molecules Toxicity of chemical molecules can be predicted to some degree 
by comparing their three-dimensional structure. We employed graph kernels to measure 
similarity between molecules from the MUTAG and PTC datasets (Toivonen et al., 2003). 
The average number of nodes per graph in these sets is 17.72 resp. 26.70; the average number 
of edges is 38.76 resp. 52.06. 
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Figure 3: Time to compute a 10 x 10 kernel matrix on SET-1 plotted as a function of 
the filling degree of the graph. Left: We compare the Sylvester equation (SYL), 
conjugate gradient (CG), and fixed point iteration (FP) approaches to the direct 
method on graphs containing 32 nodes. Right: We compare SYL, CG, and FP 
approaches on larger graphs with 1024 nodes. The direct method is infeasible on 
these graphs. 



Large Protein Graph Dataset A standard approach to protein function prediction 
entails classifying proteins into enzymes and non-enzymes, then further assigning enzymes 
to one of the six top-level classes of the Enzyme Commission (EC) hierarchy. Towards 
this end, Borgwardt et al. (2005) modeled a dataset of 1128 proteins as graphs in which 
vertices represent secondary structure elements, and edges represent neighborhood within 
the 3-D structure or along the amino acid chain. Comparing these graphs via a modified 
random walk graph kernel and classifying them via a Support Vector Machine (SVM) led 
to function prediction accuracies competitive with state-of-the-art approaches (Borgwardt 
et al., 2005). We used Borgwardt et al.'s (2005) data to test the efficacy of our methods 
on a large dataset. The average number of nodes and edges per graph in this data is 38.57 
res^. 143.75. We used a single label on the edges, and the delta kernel to define similarity 
between edges. 

Large Enzyme Graph Dataset We repeated the above experiment on an enzyme graph 
dataset, also due to Borgwardt et al. (2005). This dataset contains 600 graphs, with 32.63 
nodes and 124.27 edges on average. Graphs in this dataset represent enzymes from the 
BRENDA enzyme database (Schomburg et al., 2004). The biological challenge on this data 
is to correctly assign the enzymes to one of the EC top-level classes. 

4.2.1 Unlabeled Graphs 

For this experiment, we computed kernels taking into account only the topology of the 
graph, i.e.; we did not consider node or edge labels. Table 1 lists the CPU time required to 
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Table 1: Time to compute kernel matrix for unlabeled graphs from various datasets. 



dataset 


MUTAG 


PTC 


Enzyme 


Protein 


nodes/graph 


17.7 


26.7 


32.6 


38.6 


edges/node 


2.2 


1.9 


3.8 


3.7 


#graphs 


100 


230 


100 


417 


100 


600 


100 


1128 


Direct 


18'09" 


104'31" 


142'53" 


41h* 


31h* 


46. 5d* 


36d* 


12.5y* 


Sylvester 


25.9" 


2'16" 


73.8" 


19'30" 


48.3" 


36'43" 


69'15" 


6.1d* 


Conjugate 


42.1" 


4'04" 


58.4" 


19'27" 


44.6" 


34'58" 


55.3" 


97'13" 


Fixed-Point 


12.3" 


1'09" 


32.4" 


5'59" 


13.6" 


15'23" 


31.1" 


40'58" 



*: Extrapolated; run did not finish in time available. 

compute the full kernel matrix for each dataset, as well as — for comparison purposes — a 
100 X 100 sub-matrix. The latter is also shown graphically in Figure 4 (left). 

On these unlabeled graphs, conjugate gradient and fixed-point iteration — sped up via 
our Lemma 13 — are consistently about two orders of magnitude faster than the conventional 
direct method. The Sylvester equation approach is very competitive on smaller graphs 
(outperforming CG on MUTAG) but slows down with increasing number of nodes per 
graph; this is because we were unable to incorporate Lemma 13 into Mat lab's black-box 
dlyap solver. Even so, the Sylvester equation approach still greatly outperforms the direct 
method. 

4.2.2 Labeled Graphs 

For this experiment, we compared graphs with edge labels. Note that node labels can 
be dealt with by concatenating them to the edge labels of adjacent edges. On the two 
protein datasets we employed a linear kernel to measure similarity between edge labels 
representing distances (in Angstroms) between secondary structure elements. On the two 
chemical datasets we used a delta kernel to compare edge labels reflecting types of bonds in 
molecules. We report CPU times for the full kernel matrix as well as a 100 x 100 sub-matrix 
in Table 2; the latter is also shown graphically in Figure 4 (right). 

On labeled graphs, our three methods outperform the direct approach by about a factor 
of 1000 when using the linear kernel. In the experiments with the delta kernel, conjugate 
gradient and fixed-point iteration are still at least two orders of magnitude faster. Since 
we did not have access to a generalized Sylvester equation (18) solver, we had to use a 
Kronecker product approximation (Van Loan, 2000) which dramatically slowed down the 
Sylvester equation approach for the delta kernel. 

4.3 Protein Interaction Networks 

In our third experiment, we used random walk graph kernels to tackle a large-scale problem 
in bioinformatics involving the comparison of fairly large protein-protein interaction (PPI) 
networks. Using a combination of human PPI and clinical microarray gene expression data. 
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Figure 4: Time (in seconds on a log-scale) to compute 100x100 kernel matrix for unlabeled 
(left) resp. labelled (right) graphs from several datasets, comparing the conven- 
tional direct method to our fast Sylvester equation, conjugate gradient (CG), and 
fixed-point iteration (FP) approaches. 



the task is to predict the disease outcome (dead or alive, relapse or no relapse) of cancer 
patients. As before, we set A = 0.001 and the convergence tolerance to 10~^ for all our 
experiments reported below. 

4.3.1 Co-Integration of Gene Expression and PPI Data 

We co-integrated clinical microarray gene expression data for cancer patients with known 
human PPI due to Rual et al. (2005). Specifically, a patient's gene expression profile 
was transformed into a graph as follows: A node was created for every protein which — 
according to Rual et al. (2005) — participates in an interaction, and whose corresponding 
gene expression level was measured on this patient's microarray. We connect two proteins 
in this graph by an edge if Rual et al. (2005) list these proteins as interacting, and both 



Table 2: Time to compute kernel matrix for labeled graphs from various datasets. 



kernel 


delta 


linear 


dataset 


MUTAG 


PTC 


Enzyme 


Protein 


#graphs 


100 


230 


100 


417 


100 


600 


100 


1128 


Direct 


7.2h 


1.6d* 


1.4d* 


25d* 


2.4d* 


86d* 


5.3d* 


18y* 


Sylvester 


3.9d* 


21d* 


2.7d* 


46d* 


89.8" 


53'55" 


25'24" 


2.3d* 


Conjugate 


2'35" 


13'46" 


3'20" 


53'31" 


124.4" 


71 '28" 


3'01" 


4.1h 


Fixed-Point 


1'05" 


6'09" 


1'31" 


26'52" 


50.1" 


35'24" 


1'47" 


1.9h 



*: Extrapolated; run did not finish in time available. 
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genes are up- resp. downregulated with respect to a reference measurement. Each node 
bears the name of the corresponding protein as its label. 

This approach of co-integrating PPI and gene expression data is built on the assumption 
that genes with similar gene expression levels are translated into proteins that are more likely 
to interact. Recent studies confirm that this assumption holds significantly more often for 
co-expressed than for random pairs of proteins (Eraser et al., 2004; Bhardwaj and Lu, 
2005). To measure similarity between these networks in a biologically meaningful manner, 
we compare which groups of proteins interact and are co-regulated in each patient. For this 
purpose, a random walk graph kernel is the natural choice of kernel, as a random walk in 
this graph represents a group of proteins, in which consecutive proteins along the walk are 
co-expressed and interact. As each node bears the name of its corresponding protein as its 
node label, the size of the product graph is at most that of the smaller of the two input 
graphs. 

4.3.2 Composite Graph Kernel 

The presence of an edge in a graph signifies an interaction between the corresponding nodes. 
In chemoinformatics, for instance, edges indicate chemical bonds between two atoms; in PPI 
networks, edges indicate interactions between proteins. When studying protein interactions 
in disease, however, the absence of a given interaction can be as significant as its presence. 
Since existing graph kernels cannot take this into account, we propose to modify them 
appropriately. Key to our approach is the notion of a complement graph: 

Definition 5 Let G — (y, E) he a graph with vertex set V and edge set E. Its complement 
G — (y, E) is a graph over the same vertices but with complementary edges E := (VxV)\E. 

In other words, the complement graph consists of exactly those edges not present in the 
original graph. Using this notion we define the composite graph kernel 

kcomp{G, GO := k{G, G') + k(G, G^). (30) 

This deceptively simple kernel leads to substantial gains in performance in our experiments 
comparing co-integrated gene expression/protein interaction networks. 

4.3.3 Datasets 

Leukemia. Bullinger et al. (2004) provide a dataset of microarrays of 119 leukemia pa- 
tients. Since 50 patients survived after a median follow-up time of 334 days, always pre- 
dicting a lethal outcome here would result in a baseline prediction accuracy of 1 - 50/119 
= 57.98%. Co-integrating this data with human PPI, we found 2, 167 proteins from Rual 
et al. (2005) for which Bullinger et al. (2004) report expression levels among the 26, 260 
genes they examined. 

Breast Cancer. This dataset consists of microarrays of 78 breast cancer patients, of 
which 44 had shown no relapse of metastases within 5 years after initial treatment (van't 
Veer et al., 2002). Always predicting survival thus gives a baseline prediction accuracy 
of 44/78 = 56.41% on this data. When generating co-integrated graphs, we found 2,429 
proteins from Rual et al. (2005) for which van't Veer et al. (2002) measure gene expression 
out of the 24, 479 genes they studied. 
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Table 3: Average time to compute kernel matrix on protein interaction networks. 



dataset 


Leukemia 


Breast Cancer 


kernel 


vanilla 


composite 


vanilla 


composite 


Direct 


2hl5'23" 


5h 12'29" 


41i01'16" 


8h 24'45" 


Sylvester 


12'03" 


25'41" 


20'21" 


45'5r' 


Conjugate 


6" 


13" 


13" 


28" 


Fixed-Point 


4" 


7" 


8" 


17" 



4.3.4 Results 

The CPU runtimes of our CG, fixed-point, and Sylvester equation approaches to graph 
kernel computation on the cancer patients modeled as graphs is contrasted with that of 
the direct approach in Table 3. Using the computed kernel and a support vector machine 
(SVM) we tried to predict the survivors, either with a "vanilla" graph kernel (13), or our 
composite graph kernel (30) in 10-fold cross-validation. 

On both datasets, our approaches to fast graph kernel computation convey up to three 
orders of magnitude gain in speed. With respect to prediction accuracy, the vanilla ran- 
dom walk graph kernel performs hardly better than the baseline classifer on both tasks 
(Leukemia: 59.17 % vs 57.98 %; Breast Cancer: 56.41 % vs. 56.41 %). The composite 
graph kernel outperforms the vanilla graph kernel in accuracy in both experiments, with 
an increase in prediction accuracy of around 4-5 % (Leukemia: 63.33 %; Breast cancer: 
61.54 %). 

The vanilla kernel suffers from its inability to measure network discrepancies, the paucity 
of the graph model employed, and the fact that only a small minority of genes could be 
mapped to interacting proteins; due to these problems, its accuracy remains close to the 
baseline. The composite kernel, by contrast, also models missing interactions. With it, 
even our simple graph model, that only captures 10% of the genes examined in both stud- 
ies, is able to capture some relevant biological information, which in turn leads to better 
classification accuracy on these challenging datasets (Warnat et al., 2005). 



5. Rational Kernels 

The aim of this section is to establish connections between rational kernels on transducers 
(Cortes et al., 2004) and random walk graph kernels. In particular, we show that compo- 
sition of transducers is analogous to computing product graphs, and that rational kernels 
on weighted transducers may be viewed as generalizations of random walk graph kernels 
to weighted automata. In order to make these connections explicit we adapt slightly non- 
standard notation for weighted transducers, extensively using matrices and tensors wherever 
possible. 
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5.1 Semiring 

At the most general level, weighted transducers are defined over semirings. In a semiring 
addition and multiplication are generalized to abstract operations © and with the same 
distributive properties: 

Definition 6 (Mohri, 2002) A semiring is a system (K, ©, 0, 0, 1) such that 

1. (K, 0,0) is a commutative monoid with as the identity element for (i.e., for any 
x,y,z G K^ we have x®y ^ K, {x®y)®z = x®(y®z), x0O = O0x = x and 

x®y = y®x); 

2. (K, 0,1) is a monoid with 1 as the identity operator for (i.e., for any x,y,z G K, 

we have xQy ^K, {xQy)Q z — xQ(yQz), and x0l = l0x = x); 

3. distributes over ®, i.e., for any x,y,z ^ K, 

{x®y)Qz = {xQz) ®{y z), (31) 

z Q{x ®y) = {zQx) ®{z y); (32) 

4- is an annihilator for 0; Vx G K^ x = x = 0. 

Thus, a semiring is a ring that may lack negation. (R, +, •, 0, 1) is the familiar semiring 
of real numbers. Other examples include 

Boolean: ({False, True}, V, A, False, True); 

Logarithmic: (RU {—oc, oo}, 0in, +, —oc, 0), where Vx, y G K : x^\^y := ln(e^ + e^); 

Tropical: (R^ U {— oc}, max, +, — oc, 0). 

The (©, 0) operations in some semirings can be mapped into ordinary (+, •) operations by 
applying an appropriate morphism. 

Definition 7 Let (K, 0, 0, 0, 1) he a semiring. A function '0 : K ^ R Z5 a morphism if 

^(x0y)=^(x)+^(y); (33) 

^{xQy) ^^{x)-^{y)- (34) 

^(0) = anrf^(l) = 1. (35) 

In the following, by 'morphism' we will always mean a morphism from a semiring to the real 
numbers. Not all semirings have such morphisms. For instance, the logarithmic semiring 
has a morphism — namely, the exponential function — but the tropical semiring does not 
have one. 

Ordinary linear algebra operations including Kronecker products, matrix addition, matrix- 
vector multiplication, and matrix-matrix multiplication can be carried over to a semiring 
in a straightforward manner. For instance, if A, S G K^^^, and x G K^ then 



[^ ^]i = © ^ij Xj ^ and (36) 



j=i 



[AQB],^j^^A,uQBkj. (37) 



k=i 
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As in Appendix A, we can extend a morphism '0 to matrices (and analogously to vectors) by 
defining [^(A)]^^ := '^(A^j), and [^"■'^(A)]-^^ := tl;~^{Aij). If the semiring has a morphism, 
7/^, then it is easy to see that 



[AQx]^ = 5^^(A,,) -^(x,) = ^-\^{A)^{x)), and (38) 

n 

[AQB],^j -J2^^k' ^kJ = ^-\^{A)^{B)). (39) 

k=i 

5.2 Weighted Transducers 

Loosely speaking, a transducer is a weighted automaton with an input and output alphabet. 
We will work with the following slightly specialized definition^: 

Definition 8 A weighted finite-state transducer T over a semiring (K, 0,0,0,1) is a 5- 
tuple T = (T^^Q^A^p^q), where Tj is a finite input-output alphabet, Q is a finite set of n 
states, p G K^ is a vector of initial weights, q G K^ is a vector of final weights, and A is 
a 4- dimensional tensor in K^^' 1^' '^^ which encodes transitions and their corresponding 
weights. 

For a, 6 G S we will use the shorthand Aab to denote the nxn slice A^ab^ of the transition 
tensor, which represents all valid transitions on input label a emitting the output label b. 
The output weight associated by T to a pair of strings a — aia2 . . .ai and (5 — bib2 - . .bi is 

|ri (a, (3) = q~^ QAa,b,QAa^b2 • • -QKih QP- (40) 

A transducer is said to accept a pair of strings («,/?), if it assigns non-zero output weight 
to them, i.e., |T] (a,/3) ^ 0. A transducer is said to be regulated if the output weight 
associated to any pair of strings is well defined in K. Since we disallow 6 transitions, our 
transducers are always regulated. 

A weighted automaton is a transducer with identical input and output labels. Therefore, 
the transition matrix of a weighted automaton is a 3-dimensional tensor in K^^^l^l^^, A 
graph is a weighted automaton whose input-output alphabet contains exactly one label, 
and therefore it only accepts strings of the form a^ — aa . . .a. The transition matrix of a 
graph (equivalently, its adjacency matrix) is a 2-dimensional tensor in K^^^. If A denotes 
the adjacency matrix of a graph G, then the output weight assigned by G to a^ is 

lG\{a^)^q~^ QAQAQ...QAQP. (41) 

The inverse of T = (S, Q, A^p^ q), denoted by T"^, is obtained by transposing the input 
and output labels of each transition. Formally, T~^ = (S, Q, B,p, q) where Bah — ^ba- 

The composition of two automata T = {T^,Q,A,p,q) and T^ = {T^,Q\A\p\q^) is an 
automaton Tx = ToT^ = {T^,Qx,B,px,qx), where Qx = QxQ\px ^ p®p'^, qx := Q®q\ 



2. We disallow e transitions, and use the same alphabet for both input and output. Furthermore, in a 
departure from tradition, we represent the transition function as a 4-dimensional tensor. 

3. We use to denote the Kronecker product using the semiring operation 0, in order to distinguish it 
from the regular Kronecker product 0. 
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and Bab — ^ce^^ac^ ^cf ^^ particular, composing T with its inverse yields T o T~^ = 
{Ti^QxQ,B,p®p,q®q), where Bab — ©cgs ^ac® ^bc- There exists a general and efficient 
algorithm for composing transducers which takes advantage of the sparseness of the input 
transducers (e.g. Mohri et al., 1996; Pereira and Riley, 1997). Note that the composition 
operation, when specialized to graphs, is equivalent to computing a direct product graph. 

5.3 Kernel Definition 

Given a weighted transducer T and a function ^ : K ^ R, the rational kernel between two 
strings a ^ aia2 • • -ai and /? = 6162 . . . 6/ is defined as (Cortes et al., 2004): 

fc(a,/?) = V(lrl(a,/?)). (42) 

Cortes et al. (2004) show that a generic way to obtain p.s.d. rational kernels is to replace 
T by ToT"^, and let '0 be a semiring morphism. We now present an alternate proof which 
uses properties of the Kronecker product. Since '0 is a semiring morphism, by specializing 
(40) to Tor-\ we can write k{a,[5) = ^ ([[ToT-^]] (a,/3)) as 

^{q^qY^ I 0^aici^^ici J ...* {®^aici®Ab,A ^{p®p), (43) 

which, in turn, can be rewritten using 

"^[^Aac^AbA =5^*(A,,)0*(A,) (44) 



as 



Y, ^('Z)^ ® "^{qV {"^{Aa.c,) ® *(AiCi)) . . . {'^{Aa.c,) ® *(A,c,)) *(p) ® *(p). (45) 

C1C2...Q 

By successively applying (2) we obtain 
fe(a,/3)= Y. (*(g)^*(AicJ...*(A,cJ*(p))(*(g)^*(Aici)...*(Ac,)*(p)), (46) 

^ V '^ V ^ 

C1C2...Q y^^ yz 

p{a) p{(3) 

which shows that each individual term in the summation is a valid p.s.d. kernel. Since p.s.d. 
kernels are closed under addition, k(a^f3) is a valid p.s.d. kernel. 

5.4 Kernels on Weighted Transducers 

Rational kernels on strings can be naturally extended to weighted transducers S and U via 
(Cortes et al., 2004): 

KS. ^) = ^ I © 1^1 («) ITI (a, /?) im (/?) ) , (47) 
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which, in turn, can be rewritten as 



HS,U) = i;l^lSoToUj{a,P)\. (48) 

If V' is a semiring morphism, then 

k{S,U) = J2^{lSoToUj(a,(3)). (49) 

Since p.s.d. kernels are closed under addition, ii il; (IS oT oU} (a, /?)) is a p.s.d. kernel, then 
k{S, U) is also a valid p.s.d. kernel. 

5.5 Recovering Random Walk Graph Kernels 

In order to recover random walk graph kernels we use the standard (R, +, •, 0, 1) ring as our 
semiring, and hence set tp to be the identity function. Note that since we are dealing with 
graphs, the only strings which are assigned non-zero weight are of the form a^ = aa . . .a. 
Finally, we set the transducer T to simply accept all strings of the form a^ with unit weight. 
In this case, the kernel specializes to 

k{G,G') = Y,[[GoG']]{a^). (50) 

Recall that the normalized adjacency matrix oi G o G' is A^ '.— A® A' ^ where A and A' are 
the normalized adjacency matrices of G and G' respectively. By specializing (41) to Go G^ 
we can rewrite (50) as 

A:(G,G0 = 5^gx^'xPx. (51) 

k 

This essentially recovers (8) with the weight matrix set to the adjacency matrix, but, without 
the discrete measure ii{k). 

6. R-convolution Kernels 

Haussler's (1999) R-convolution kernels provide a generic way to construct kernels for dis- 
crete compound objects. Let x ^ X he such an object, and x \— (xi,X2, . . . ^xd) denote a 
decomposition of x, with each Xi ^ Xi. We can define a boolean predicate 

R: X xX ^ {True, False}, (52) 

where X \— XiX . . . x X^ and i?(x, x) is True whenever x is a valid decomposition of x. 
This allows us to consider the set of all vahd decompositions of an object: 

R-^{x) := {x\R{x,x) = True}. (53) 
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Like Haussler (1999) we assume that R ^(x) is countable. We define the R-convolution ^ 
of the kernels /^i, /^25 • • • , ^D with /^^ : A'^ x A'^ ^ R to be 

D 

k{x^x) — Hiii^ Hi2^ • • -^ I^d{x^x) := >. M(^5^0 IT ^i(^^5^i)5 (^4) 

where /x is a finite measure on X x X which ensures that the above sum converges.^ Haussler 
(1999) showed that fc(x, x') is p.s.d. and hence admissible as a kernel (Scholkopf and Smola, 
2002), provided that all the individual ni are. The deliberate vagueness of this setup in 
regard to the nature of the underlying decomposition leads to a rich framework: Many 
different kernels can be obtained by simply changing the decomposition. 

6.1 Graph Kernels as R-Convolutions 

To apply R-convolution kernels to graphs, one decomposes the graph into smaller substruc- 
tures, and builds the kernel based on similarities between those components. Most graph 
kernels are — knowingly or not — based on R-convolutions; they mainly differ in the way 
they decompose the graph for comparison and the similarity measure they use to compare 
the components. 

Random walk graph kernels, as proposed by Gartner et al. (2003), decompose a graph 
into paths and compute a delta kernel between nodes. Borgwardt et al. (2005), on the 
other hand, use a kernel defined on nodes and edges in order to compute similarity between 
random walks. As we saw in Section 2.3, the marginalized graph kernels of Kashima et al. 
(2004) are closely related, if motivated differently. The decomposition corresponding to this 
kernel is the set of all possible label sequences generated by a walk on the graph. Mahe 
et al. (2004) extend this approach in two ways: They enrich the labels via the so-called 
Morgan index, and modify the kernel definition to prevent tottering^ that is, the generation 
of high similarity scores by multiple, similar, small substructures. Both these extensions 
are particularly relevant for chemoinformatics applications. 

Horvath et al. (2004) decompose a graph into cyclic patterns, then count the number of 
common cyclic patterns which occur in both graphs. Their kernel is plagued by computa- 
tional issues; in fact they show that computing the cyclic pattern kernel on a general graph 
is NP-hard. They consequently restrict their attention to practical problem classes where 
the number of simple cycles is bounded. 

Other decompositions of graphs, which are well suited for particular application domains, 
include subtrees (Ramon and Gartner, 2003), shortest paths (Borgwardt and Kriegel, 2005), 
molecular fingerprints based on various types of depth- first searches (Ralaivola et al., 2005), 
and structural elements such as rings, functional groups (Frohlich et al., 2006), and so on. 



4. Haussler (1999) implicitly assumed this sum to be well-defined, and hence did not use a measure /i in 
his definition. 
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6.2 R-Convolutions in Abstract Semirings 

There have been a few attempts to extend the R-convolution kernel (54) to abstract semir- 
ings, by defining: 

D 

k{x, x) := ^ ii{x, x') O i^i{xi, X-). (55) 

The optimal assignment graph kernel of Frohlich et al. (2006) is motivated along these lines, 
using the tropical semiring. It is defined as 

D 

k{x^x^)— max ji{x^x^)y i<ii{xi^x[). (56) 

x^R-^{x) ^ 

x'eR-^{x') '~ 

Unfortunately this kernel is not always p.s.d. (Vert, 2008). Establishing necessary and 
sufficient conditions for (55) to be p.s.d. remains an open problem. 

7. Diffusion-Based Graph Kernels? 

The adjacency matrix and its normalized cousin are not the only n x n matrices associated 
with undirected graphs. Spectral graph theorist instead prefer to use the so-called graph 
Laplacian 



^ij 


= [D- 


- A]ii = < 


-Wij 

di 



if i ~ J 
if i = j 
otherwise 



(57) 



or the normalized Laplacian L — D~^/^LD~^/^. One can extend the concept of a feature 
matrix of a graph, $(X), to the Laplacian: set ^{D) to be a diagonal matrix with diagonal 
entries [^{D)]ii = J2j[^{^)]ij ^^d non-diagonal entries ( (the null label). Now define 
$(L) :=$(£>) -$(X). 

Just as A is related to random walks, L is closely connected to the concept of diffusion. 
In fact, diffusion can be regarded as the continuous time limit of a specific type of random 
walk, in which over each infinitesimal time interval of length e, a particle at node Vi will 
either move to one of its neighbors Vj with probability ewij, or stay at Vi with probability 
1 — e J2ir^j ^ij — ^ — ^di. Setting e = 1/m for some integer m going to infinity, it is easy to 
see that for any finite time interval of length t the transition matrix of this process is 

Kt= lim { I =: exp(tL), (58) 

m^oo y m J 

the matrix exponential of th. The ability of the random walk to stay in place is crucial to 
taking the continuous time limit. 
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7.1 Diffusion Kernels on Graph Vertices 

Note that in contrast to the normahzed adjacency matrix A, the Laplacian is a symmetric 
matrix. Exploiting this property and the fact that (58) is unchanged if we let m = 2m^ and 
now make m^ go to infinity {i.e., we force m to be even), we may equivalently write 



Kt = exp(tL) = lim 



■>(X) 



tL_ 



/--) • (59) 



Since the product of any matrix with its transpose is p.s.d., we conclude that Kt is symmetric 
and p.s.d.^ and thus k{vi, vj) := [Kt]ij is a valid candidate for a kernel between graph vertices 
(Kondor and Lafferty, 2002). 

The justification why Kt should be a good kernel for learning problems comes from 
deeper arguments connecting spectral graph theory and regularization (Smola and Kondor, 
2003). For example, given any function /: T^ ^ R, and letting f = (/('^i), f{v2)i • • • , fi'^n))^^ 
it is easy to see that 

f^Lf= E^^i[/(^i)-M)]'- (60) 

Vir^Vj 

The right-hand side of this equation is a natural measure of the variation of / across edges, 
showing that in some well defined sense the Laplacian captures the smoothness of functions 
on graphs. In fact, it can be shown that the regularization scheme implied by the diffusion 
kernel is just the discretized, graph-adapted cousin of the the regularization theory behind 
the familiar Gaussian kernel (Smola and Kondor, 2003). 

In applications it often makes sense to modify the above picture somewhat and instead 
of L use the normalized Laplacian L, mostly because of an important result stating that the 
eigenvalues of the latter are bounded between and 2 (Chung-Graham, 1997). Breaking 
away from the original diffusion interpretation, but still adhering to the spectral graph 
theory dogma that the eigenvectors vq, vi, . . . , v^_i of L in some sense capture the principal 
directions of variation of functions on G, the boundedness of the eigenvalues Aq, Ai, . . . , A^^-i 
allow us to construct a whole family of possible kernels of the form 

K = X^(r(A,))-i V, v7, (61) 

i=0 

the only restriction on r being that it must be positive and increasing on [0,2]. A variety 
of such functions is presented by Smola and Kondor (2003). 

The connection between the adjacency matrix of the direct product graph and simul- 
taneous random walks, presented in Section 2, raises the intriguing possibility that the 
Laplacian might also be useful in defining kernels on graphs, as opposed to just graph ver- 
tices. In particular, replacing Wx in (8) by L®L yields an alternate similarity measure 
between graphs. If L L were the Laplacian of the direct product graph, then this would 
amount to computing the expectation of the diffusion kernel on the nodes of the product 
graph under the distribution vec(gxPx)- 

Unfortunately, the Laplacian of the product graph does not decompose into the Kro- 
necker product of the Laplacian matrices of the constituent graphs, that is, Lx 7^ L0L . 
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Figure 5: Two graphs (top left & right) and their Cartesian product (bottom). Each node 
of the Cartesian product graph is labeled with a pair of nodes; an edge exists 
in the Cartesian product if and only if the corresponding nodes are identical in 
one and adjacent in the other original graph. For instance, nodes 31^ and 32^ are 
adjacent because they refer to the same node 3 in the first graph, and there is an 
edge between nodes V and 2' in the second graph. 



Therefore, replacing Wx by L L leads to a vahd p.s.d. kernel, but then we lose the physical 
interpretation relating this to a diffusion process on a product graph. This can be rectified 
by employing the Cartesian product graph instead. 

7.2 Cartesian Product Graph Kernels 

Given two graphs G{V,E) and G^{V\E^){with \V\ = n and \V^\ = n^), the Cartesian 
product G\j (Imrich and Klavzar, 2000) is a graph with vertex set 

Va = {ivi,vi,):vieV,vi,eV'}, (62) 

and edge set 

^D = {((^i^^iO, i^j^j')) • ^i = ^j and {v-;,Vy)eE' V v-; = Vy and {vi,Vj)eE} (63) 
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(c/. Figure 5). It is easy to verify that An = A0A and Ln = L©L , where © is the 
Kronecker sum (3). One can now write an analogue of (8) for the Cartesian product graph: 
Given the weight matrix W\j := $(L) ©$(L'), initial and stopping probability distributions 
PD '•— {p ©y) and q\j := (q ® q^), and an appropriately chosen discrete measure /i, we can 
define a kernel on G and G^ as 

oo 

fc(G,G0:=5^/i(A:)g5H^S^ra. (64) 

k=i 

Letting $(L)^ = I we show in Appendix B that 

Lemma 9 // the measure ii{k) is such that (64) converges, then it defines a valid p.s.d. 
kernel 

Two things are worth noting here: First, we use W^^ instead of W^ as was used in (8). This 
helps us to overcome the technical difficulty that while Wx is a real matrix W\j is a matrix 
in RKHS. Although we will not pursue this avenue here, we note in the passing that one can 
analogously redefine (8) using W^^. Second, we define p\j (analogously q\j) as p0p' instead 
of the Kronecker sum ^(p©y), which would also define a valid probability distribution. The 
problem with the latter formulation is that it would destroy the invariance of the diffusion 
kernel to permutation of graph nodes, and thus render it far less useful. 

7.3 Efficient Computation of Cartesian Product Kernels 

The techniques for efficiently computing random walk graph kernels via direct product 
graphs we introduced in Section 3 are equally applicable to the computation of diffusion- 
based kernels via Cartesian product graphs. In particular, the conjugate gradient (Sec- 
tion 3.2) and fixed-point iteration (Section 3.3) methods can be used without any modifi- 
cation. They will take at most twice as much time on the Cartesian product graph than on 
the direct product graph, due to the lower sparsity of the Kronecker sum vs. the Kronecker 
product of two sparse matrices. 

Our Sylvester equation-based method (Section 3.1) can also be used here: Assume that 
the weight matrix W\j can be written as 

d 
H^□ = ^^Let^ (65) 

1=1 

where the ^L and ^V are label-filtered normalized Laplacian matrices, which are defined 
analogously to label-filtered normalized adjacency matrices (c/. Section 2.1). The problem 
of computing the graph kernel (64) with /i(A:) := A^ can be reduced to the problem of 
solving the Sylvester equation: 

d d 

M = ^ A^L' MI^ + ^ AIa/ M ^L^ + Mq, (66) 

i=i i=i 

where vec(Mo) = p\j. We begin by flattening (66): 

d d 

vec(M) = A ^ vec(i'M I^) + A J^ vec(lA/ M 'L^) + pn- (67) 

1=1 i=i 
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Using (77) we can rewrite (67) as 

d 
(I -X^'L® V) vec(M) = pa (68) 

i=l 

use (65), and solve (68) for vec(M): 

vec(M) = (I -XW^y^pB. (69) 

Multiplying both sides of (69) by q^ yields 

g5vec(M) = q^{I -XWaypo- (70) 

The right-hand side of (70) is the Cartesian product kernel (64). Compared to the 
direct product kernel, the computation will take twice as long because the degree of the 
generalized Sylvester equation (66) is now 2d instead of d. 

7.4 A Deficiency of Diffusion- Based Graph Kernels 

Putting everything together, we can construct diffusion-based graph kernels via the Carte- 
sian product graph, and evaluate them efficiently. However, we found the resulting diffusion- 
based graph kernels to suffer from a troubling deficiency: Recall that Da = ^ • A^j, while 

L = D — A. This means that Vi : [Le]^ = Da — J2j^ij — 0- I^ other words, e is an 
eigenvector of L with zero eigenvalue (and consequently L is rank deficient). Thus for any 

k > we have L e = 0. 

In the absence of any prior knowledge about the data on hand, it is, natural to set the 
initial and stopping probabilities p\-\ resp. q\j to uniform distributions; given graphs G and 
G^ of size n and n\ respectively, we set p\j — Q\J — ^ /(nn^). The above discussion, however, 

implies that then q^L p\j = for all /c > 0, and consequently (64) with W\j = L\j is 
uniformly zero. 

One might be tempted to create non-uniform p\j and q\j based on available properties 
of G and G^ such as the degree of their nodes: p\j ^ Q\J ^ diag(i?) diag(i?^). For every 
such strategy, however, there will be a subclass of graphs (in our example: regular graphs) 
which yields uniform distributions, and whose members are therefore indistinguishable to 
diffusion-based kernels. Breaking this uniformity arbitrarily would conversely destroy the 
permutation invariance of the kernel. 

We are forced to conclude that in order to use diffusion-based graph kernels, we must 
have either a) some prior knowledge about the initial and stopping probabilities on the 
graph, or b) a rich enough feature representation to ensure that the weight matrix W\j is 
not rank deficient. Since none of our datasets satisfy this requirement we do not report 
experiments on diffusion-based graph kernels. 

8. Outlook and Discussion 

As evidenced by the large number of recent papers, random walk graph kernels and marginal- 
ized graph kernels have received considerable research attention. Although the connections 
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between these two kernels were hinted at by Kashima et al. (2004), no effort was made to 
pursue this further. Our aim in presenting a unified framework to view random walk graph 
kernels, marginalized kernels on graphs, and geometric kernels on graphs is to highlight the 
similarities as well as the differences between these approaches. Furthermore, this allows us 
to use extended linear algebra in an RKHS to efficiently compute these kernels by exploiting 
structure inherent in these problems. 

Although rational kernels have always been viewed as distinct from graph kernels, we 
showed that in fact these two research areas are closely related. It is our hope that this will 
facilitate cross-pollination of ideas such as the use of semirings and transducers in defining 
graph kernels. We also hope that tensor and matrix notation become more prevalent in the 
transducer community. 

It is fair to say that R-convolution kernels are the mother of all kernels on structured 
data. It is enlightening to view various graph kernels as instances of R-convolution kernels 
since this brings into focus the relevant decomposition used to define a given kernel, and 
the similarities and differences between various kernels. However, extending R-convolutions 
to abstract semirings does not always result in a valid p.s.d. kernel. 

The links between diffusion kernels and generalized random walk kernels on graphs 
are intriguing. It is fascinating that direct product graphs are linked with random walks, 
but Cartesian product graphs arise when studying diffusion. Surprisingly, all our efficient 
computational tricks from the random walk kernels translate to the diffusion-based kernel. 
As we showed, however, a rank deficiency limits their applicability. Identifying domains 
where diffusion-based graph kernels are applicable is a subject of future work. It is plausible 
that W\j in (64) can be replaced by a spectral function similar to (61). The necessary and 
sufficient conditions for admissible spectral functions r(A) in this case remain an open 
question. 

As more and more graph-structured data (e.g., molecular structures and protein inter- 
action networks) becomes available in fields such as biology, web data mining, etc., graph 
classification will gain importance over the coming years. Hence there is a pressing need to 
speed up the computation of similarity metrics on graphs. We have shown that sparsity, 
low effective rank, and Kronecker product structure can be exploited to greatly reduce the 
computational cost of graph kernels; taking advantage of other forms of structure in Wx re- 
mains a computational challenge. Now that the computation of random walk graph kernels 
is viable for practical problem sizes, it will open the doors for their application in hitherto 
unexplored domains. 

A major deficiency of the random walk graph kernels can be understood by studying 
(13). The admissible values of the decay parameter A is often dependent on the spectrum 
of the matrices involved. What this means in practise is that one often resorts to using 
very low values of A. But a small A makes the contributions to the kernel of higher-order 
terms (corresponding to long walks) negligible. In fact in many applications a naive kernel 
which simply computes the average kernel between all pairs of edges in the two graphs has 
performance comparable to the random walk graph kernel. 

Trying to rectify this situation by normalizing the matrices involved brings to the fore 
another phenomenon called tottering (Mahe et al., 2004). Roughly speaking tottering im- 
plies that short self-repeating walks have a disproportionately large contribution to the 
kernel value. Consider two adjacent vertices v and v^ in a graph. Because of tottering 
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contributions due to walks of the form v ^ v' ^ v ^ . . . dominate the kernel value. Unfor- 
tunately a kernel using self-avoiding walks (walks which do not visit the same vertex twice) 
cannot be computed in polynomial time. 

We do not believe that the last word on graph comparison has been said yet. Thus far, 
simple decompositions like random walks have been used to compare graphs. This is mainly 
driven by computational considerations and not by the application domain on hand. The 
algorithmic challenge of the future is to integrate higher-order structures, such as spanning 
trees, in graph comparisons, and to compute such kernels efficiently. 
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Appendix A. Extending Linear Algebra to RKHS 

It is well known that any symmetric, positive definite kernel a^ : X x X ^ W has a corre- 
sponding Hilbert space H (called the Reproducing Kernel Hilbert Space or RKHS) and a 
feature map (j) : X ^ TC satisfying hc(x,x^) = {(j){x)^(j){x'))^. The natural extension of this 
so-called feature map to matrices is $: X"^^"^ ^ j^nxm (jeg^ed [$(A)]ij := 0(^ij). In what 
follows, we use $ to lift tensor algebra from X to TC, extending various matrix products to 
the RKHS, and proving some of their their useful properties. Straightforward extensions 
via the commutativity properties of the operators have been omitted for the sake of brevity. 

A.l Matrix Product 

Definition 10 Let A G A'^''^, B G X'^''^, andC e R'^''^. The matrix products^ (A) ^{B) G 
W'P and $(A) C G H'^^'P are given by 

[$(A)$(5)],, := Yl (^(^^^■)' ^(^Jk))n ^^d [^(^) C]^k := Y. ^(^^^■) ^3^' 



It is straightforward to show that the usual properties of matrix multiplication — namely 
associativity, transpose-commutativity, and distributivity with addition — hold for Defini- 
tion 10 above, with one exception: associativity does not hold if the elements of all three 
matrices involved belong to the RKHS. In other words, given A G X^^^^ B G X^^^^ and 
C G XP''^ in general [$(A)$(5)]$(C) ^ $(A)[$(S)$(C)]. The technical difficulty is that 

(0(A,,), HBjk))n ^(Cki) + ^{M,) (0(5,,), ^{Cu))n • (71) 

Further examples of statements like (71), involving properties which not hold when extended 
to an RKHS, can be found for the other matrix products at (73) and (79) below. 

Definition 10 allows us to state a first RKHS extension of the vec(ABC) formula (1): 

Lemma 11 // A G R^'^^, B G A'^^^, and C G W"^, then 

vec(A$(S)C)) = (C^0A)vec($(5)) G A'^^^^ 

Proof Analogous to Lemma 13 below. ■ 



A. 2 Kronecker Product 

Definition 12 Let A G X'''''^ and B G XP"""^. The Kronecker product $(A) $(S) G 
'^"^^ is defined as 

[$(A) $(5)](,_i)^+fc^(,_i),+^ := {(l){Aj),cl){Bki))^ . 
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Similarly to (71) above, for matrices in an RKHS 

* ($(A) $(5))($(C) $(£>)) = ($(A) $(C)) ($(S) $(D)) (72) 

does not necessarily hold. The technical problem with (72) is that generally 

{(f>{Air), <i>{Bks))n (•/'(Cri), <t>{Dsi))n + k^{Mr). <t>{Crj))n {<t>{Bks), <t>{Dsi))n ■ (73) 

In Section A. 3 we show that analogous properties (Lemmas 15 and 16) do hold for the 
heterogeneous Kronecker product between RKHS and real matrices. 

Definition 12 gives us a second extension of the vec(ABC) formula (1) to RKHS: 

Lemma 13 If A ^ A'^><^, B G M^""^, and C G A'^''^ then 

vec($(A) B $(C)) = ($(C)^0 $(A)) vec(5) G ^'^^''^ 



Proof We begin by rewriting the k^^ column of <^[A)B^{C) as 



*J 



= [4>{Cik)HA)A{C2k)HA), ■ . . 4>{Cnk)HA)] 



B^2 



vec(B) 



= mCik)AiC2k), . ■ ■ <t>{Cnk)] ® HA)) vec(S). 
To obtain Lemma 13 we stack up the columns of (74): 



vec($(^)S$(C)) 



/ 



•/-(Cii) (/.(C21) ... </.(C„i) 



(t^iCln) (t>{C2n) ... (t>{Cnn) 

($(C)^®$(A))vec(S). 



HA) 



vec{B) 



(74) 



Direct computation of the right-hand side of Lemma 13 requires nmpq kernel evaluations; 
when 771, p, and q are all 0{n) this is O(n^). If TC is finite-dimensional, however — in other 
words, if the feature map can be taken to be 0: X ^M.^ with d < oc — then the left-hand 
side of Lemma 13 can be obtained in 0{n^d) operations. Our efficient computation schemes 
in Section 3 exploit this observation. 
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A. 3 Heterogeneous Kronecker Product 

Definition 14 Let A G ;f"><'" and B G M^^*. The heterogeneous Kronecker product ^^ [A) (, 
B G A"^P^"^« is given by 

Recall that the standard Kronecker product obeys (2); here we prove two extensions: 

Lemma 15 If A e A'"^'", B G A'^^*, C G R""""", and D G M'^''^ then 
($(A) O $(S))(C ® D) = ($(A) C) ($(B) D). 

Proof Using the linearity of the inner product we directly verify 

r,s 

= /j2cj){Air)Crj,J2'l'(^ks)Dsl\ 
\ r s I ■}{ 

= mA)C],,,MB)D]ki)^ 

= [{^A) C) ® ($(B)D)](,_i)p+fc,(,_i),+; 



Lemma 16 If A ^ A'"^'", B G W'i , C G X"^""" , and D G W\ then 
($(A) O S)($(C) ®D) = ($(A) $(C)) (B D). 

Proof Using the linearity of the inner product we directly verify 

[{^{A) ® B){^{C) (g> D)]^i_^p+k,{j-l)q+l = Yl {^{Ar)Bks, ct>{Crj)D,i)^ 

= Y ('/'(^ir), 0(ai))w Y. ^ksDsl 
r s 

^mA)^{C)]ij[BDU 

= [($(A) $(C)) ® (i?D)](i_i)p+fc,(,_i),+; 



Using the heterogeneous Kronecker product, we can state four more RKHS extensions of 
the vec-ABC formula (1): 
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Lemma 17 If A e A'^^^, B G W^^'p, and C G W"^, then 

Yec{^{A)BC) = (C^0$(A))vec(S) G A'^^^^ 

Proof Analogous to Lemma 13. ■ 

Lemma 18 If A e W^'''^, B G W^''^, and C G A'^''^ then 

Proof Analogous to Lemma 13. ■ 

Lemma 19 If A e X'''''^ , B G X'^'^tp , and C G W^^ , then 

vec($(A) $(S) C) = (C^0 $(A)) vec($(S)) G M^^^^ . 

Proof Analogous to Lemma 13. ■ 

Lemma 20 If A ^ W""^ , B G X'^''^ , and C G X^'^'i , then 

vec(A $(S) $(C)) = ($(C)^0 A) vec($(S)) G M^^^^ . 

Proof Analogous to Lemma 13. ■ 

A. 4 Kronecker Sum 

Unlike the Kronecker product, the Kronecker sum of two matrices in an RKHS is also an 
matrix in the RKHS. From Definition 1 and (3) we find that 

[A © B](^i_i-)pj^j.^^j_i-)qj^i := Aij5ki + 5ijBki. (75) 

We can extend (75) to RKHS, defining analogously: 

Definition 21 Let A G X'''''^ and B G X^'^'i . The Kronecker sum $(A) ©$(S) G A'^^x^^ 
is defined as 

[$(A) © $(S)](,_i)^+^^(^-_i)^+^ := ^{A^j)5ki + 5^J^{Bkl). 
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In other words, in an RKHS the Kronecker sum is defined just as in (3): 

$(A) e $(S) = $(A) Ib + Ia $(S), (76) 

where Im denotes the real- valued identity matrix of the same dimensions (not necessarily 
square) as matrix M. In accordance with Definition 14, the result of (76) is an RKHS 
matrix. 

The equivalent of the vec-ABC formula (1) for Kronecker sums is: 

{A B) vec(C) = (A Ib + Ia ®B) vec(C) 

= (A Ib) vec(C) + (Ia 05) vec(C) 

= vec(lB CA^) + Yec{BC I J) (77) 

= Yec{lBCA~^^BClJ). 

This also works for matrices in an RKHS: 

Lemma 22 If A e A'^><^, B G A'^''^ and C G A'^^^, then 

($(A) $(5)) vec($(C)) = vec(lB $(C) $(A)^+ $(5) $(C) ij) G R^^^^ . 

Proof Analogous to (77), using Lemmas 19 and 20. ■ 

Furthermore, we have two valid heterogeneous forms that map into the RKHS: 

Lemma 23 If A e A'^^^, B G A'^''^ and C G M^'''^, then 

($(A) $(S)) vec(C) = vec(lB C $(A)^+ $(S) ClJ) G A'^^^^ . 

Proof Analogous to (77), using Lemmas 17 and 18. ■ 

Lemma 24 If A e W""^, B G W"^, and C G Af^^^, t/ien 

(A B) vec($(C)) = vec(lB $(C) A'^ + B $(C) ij) G A'^^^^ . 

Proof Analogous to (77), using Lemma 11. ■ 

A. 5 Hadamard Product 

While the extension of the Hadamard (element-wise) product to an RKHS is not required 
to implement our fast graph kernels, the reader may find it interesting in its own right. 
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Definition 25 Let A,B e A"^^™ and C G M"^™. The Hadamard products ^A) $(B) G 
M"^*" and $(A) C e 7^"'^"' are 5i?;en hy 

[^{A) $(S)]i,- = {4>{Aij), (PiBij))^ and [^A) C]ij = 0(Ai,-) C^,-. 
We prove two extensions of (4) : 

Lemma 26 If A e A'"^'", B G A'^^'?, C £ R"^"", and D £ RP^«, then 

($(A) $(B)) (C D) = ($(A) C) ($(S) D). 

Proof Using the linearity of the inner product we directly verify 

[($(A) $(S)) (C D)]^i_i)p+k,{j-i)q+i = {HAij),HBki))H CijDki 

= {<P{Aj)Cij,(p{Bki)Dki)-^ 
= {MA)QC]ij,MB)QD]ki)^ 

= [($(A) C) ($(B) D)](i_i)p+fc,y_i),+; 



Lemma 27 If A e ;f"^'", S G M^''^ C G ;f"^'", and D G M^''^ t/ien 

($(A) B) ($(C) D) = ($(A) $(C)) (S D). 

Proof Using the linearity of the inner product we directly verify 

[($(A) B) ($(C) D)](i_i)p+fc,(,_i),+; = (,/.(^i,)Sfci,,/.(ai)l?w)„ 

= {(f){Aij), (f){Cij))^ BkiDki 
^MA)Q^{C)]ij[BQD]ki 

= imA) $(C)) (S D)](i_i)p+fc,y_i),+i 

■ 

As before, 

* ($(A)0 $(5)) 0($(C)0$(Z))) = ($(A)0$(C))0($(S)0 $(£>)) (78) 

does not necessarily hold, the difficulty with (78) being that in general, 

{(t>{Aij), (t>{B,i))^ {(f>{Cij), (f>{Dki))n + (</'(^ii), </'(C,,))^ (0(i?w), '^iPu))u ■ (79) 
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Appendix B. Cartesian Product Kernels: Proof of Lemma 9 

We first prove the following technical lemma: 

Lemma 28 V fe G N : W^pa = J2[-) vec[$(L')''"V ($(L)V)^]. 

i=o ^^^ 

Proof By induction over k. Base case: k = 1. Recall that (J) = (j = 1. Using 
PO — P'S^p' — vec{p'p~^), the definition of W\j, and Lemma 23, we have 

Wapn = i^{L) e $(L')) vec(p'p^) 

= vec(p'p^$(L)^ + ^{L')p'p'^) 



= E ( •) vec[$(L')'=-V($(L)V)" 



Induction from k to k -\- 1: Using the induction assumption and Lemma 23, we have 

= J2 (^) vec[$(L')'="V miy+^py + $(L')'^"^+ V (*(i^)V)^] 
i=o ^^^ 



fc+i 

/ /i* \ .... . ^ 

(80) 



^ . _ veci^lLO'^-^^ V (*(i^yp)^] 
+ E Q) vec[$(L')'=-^+V($(^)V)^], 



i=0 

where j \— i ^ 1. Pulhng out the terms for j = A: + 1 and z = 0, we can write (80) as 
W^^PU = vec[p^($(L)^+V)^]+vec[$(LO^+Vp^] 



+ E 



'^'.-X: 



vec[$(L')''"'+ V (*(i^)» ' ] ■ (81) 



Using the well-known identity {/l ) — (j_i) + (j) (e-(?-, Abramowitz and Stegun, 1965, 
Section 24.1.1), and {^^^) = (^+J) = 1, we can finally rewrite (81) to yield 



W^^^pu = (^;^^)vec[$(LO'=+Vp^] + Q;^J)vecb'($(L)'=+M^] 
+ E f ^ ^ "^1 vec[$(L')'^-^+V ($(^)V)^] 
= E( I yec[^{L'f^^-'p'{HL)Wi (82) 
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which has the form required by the induction. 
We are now positioned to prove Lemma 9: 
Proof Using Lemmas 23 and 28, we have 

2/c 



= E f^^l vec[g'^$(LO^'=-V($(L)V)\] 

2/c 

Y, {^^) {qMLyvV {q'MLT-v') . (83) 



'"^ P(G)^ P{G') 



Each individual term of (83) equals piG)^ p{G') for some function p, and is therefore a 
valid p.s.d. kernel. Because the class of p.s.d. kernels is closed under non-negative linear 
combinations (Berg et al., 1984), the above sum of kernels with positive coefficients (^^ ) is 
a valid p.s.d. kernel. 
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